Skip to contents

Modelling COVID-19 infection across England

In this tutorial we’ll cover some work on fitting a purely spatial Bayesian model to predict the COVID-19 infection rate across England.

Study aim and data description

This study describes how to fit a purely spatial Bayesian hierarchical model (BHM) based on Markov Chain Monte Carlo (MCMC) simulation method to estimate the spatial pattern of COVID-19 infection rate in England. The first thing is to load all the packages used in the COVID-19 case study.

Install CARBayes

This tutorial requires the CARBayes package, please install it before continuing through the tutorial.

Retrieving data

The study region is mainland England, which is partitioned into 6789 neighbourhoods at the Middle Layer Super Output Area (MSOA) scale. The infections data are the total reported number of COVID-19 cases in each MSOA from Jan 8, 2022 to March 26, 2022. The shapefile of the study region is a SpatialPolygonsDataFrame, which is used to map the data. It stores the location, shape and attributes of geographic features for the neighbourhoods.

We first load INLA and then retrieve the data from the fdmr example data store. We’ll use retrieve_tutorial_data to do this.

library(INLA)
fdmr::retrieve_tutorial_data(dataset = "covid_mcmc")
## 
## Tutorial data extracted to  /home/runner/fdmr/tutorial_data/covid_mcmc

The COVID-19 data and the related covariate information are included in our tutorial data package. We’ll load in the data using the load_tutorial_data function.

s_covid <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "s_covid.rds")

Next we’ll use the load_tutorial_data function to load in the spatial data we want.

sp_data <- fdmr::load_tutorial_data(dataset = "covid_mcmc", filename = "spatial_data.rds")

In this study, we use the areal unit modelling approach to fit the BHM and then make model inference using MCMC method. To do this, we need to construct a non-negative symmetric \(n \times n\) neighbourhood or adjacency matrix \(\boldsymbol{W}\) that accounts for the spatio-temporal autocorrelation structure, where \(n=6789\) is the number of areal units. The neighbourhood matrix specifies the spatial closeness between pairs of areal units. The elements \(\{w_{ij}\}\) in \(\boldsymbol{W}\) can be either continuous or binary, and a larger value of \(w_{ij}\) represents that MSOAs \((i,j)\) are spatially closer to each other. Here we use the border sharing specification, so that \(w_{ij}=1\) if MSOAs \((i,j)\) share a common geographical border, and \(w_{ij}=0\) otherwise.

W_nb <- spdep::poly2nb(sp_data, row.names = rownames(sp_data@data))
w <- spdep::nb2mat(W_nb, style = "B")

Model specification

We use a Bayesian hierarchical model to predict the spatial COVID-19 infection rate at the MSOA level in England. Let \(Y_{i}\) denote the total number of reported COVID cases for neighbourhood \(i=1,\ldots, n(=6789)\) during the study period, and \(N_{i}\) denote the (official) estimated population living in MSOA \(i\). \(Y_{i}\) is assumed to have a Poisson distribution with parameters (\(N_{i}\), \(\theta_{i}\)), where \(\theta_{i}\) is the true unobserved COVID-19 infection rate in MSAO \(i\). We follow a standard path in modelling \(\theta_{i}\) with a log link to the Poisson and start with a model where the linear predictor decomposes additively into a set of covariates and a Gaussian Markov Random Field process, which characterises the infection of the disease after the covariate effects have been accounted for. A general Bayesian hierarchical model commonly specified is given by

\[\begin{align} \nonumber Y_{i}\vert N_{i}, \theta_{i} &\sim \text{Poisson}(N_{i}\theta_{i}),\ \ i=1,\ldots,n,\\ log(\theta_{i} )&=\boldsymbol{x_{i}^{\top}}\boldsymbol{\beta}+\phi_{i}. \end{align}\]

The spatial random effects \(\{\phi_i\}\) are included in the model to account for any residual spatio-temporal autocorrelation after adjusting for covariates \(\boldsymbol{x_{i}}\). Here we utilise the spatial modelling structure “BYM”, proposed by Besag, York, and Mollié (1991), to model \(\{\phi_i\}\). It is given by

\[\begin{align} \nonumber \phi_i &=\phi_i^{(1)}+\phi_i^{(2)}\\ \phi_i^{(1)}\vert\boldsymbol\phi_{-i}^{(1)}&\sim \text{N}\left( \frac{\sum_{j=1}^{n}w_{ij}\phi_j^{(1)}}{\sum_{j=1}^{n}w_{ij}}, \frac{\tau_1^2}{\sum_{j=1}^{n}w_{ij}}\right)\\ \nonumber \phi_i^{(2)}&\sim \text{N}(0, \tau_2^2), \end{align}\]

where \(\phi_i\) now consists of two components. \(\phi_i^{(1)}\) is assigned the intrinsic CAR prior (Besag et al., 1991), and \(\phi_i^{(2)}\) is a set of independent and identically normally distributed random effects, with mean zero and common variance \(\tau_2^2\).

Define the model formula

In order to fit the spatial model, a model formula needs to be defined, by including the response in the left-hand side and the fixed and random effects in the right-hand side. First, we consider the scenario of including no covariates.

form <- total.cases ~ 1 + stats::offset(log(population))

Fit the model

Finally, we fit the spatial model using the function S.CARbym() of the package CARBayes developed by Lee (2013).

MCMC_model <- CARBayes::S.CARbym(
  formula = form,
  data = s_covid,
  family = "poisson",
  W = w,
  burnin = 10000,
  n.sample = 30000,
  thin = 10,
  verbose = F
)
## Error in loadNamespace(x): there is no package called 'CARBayes'

For comparison purpose, we fit a separate BHM to the same dataset using the INLA approach. Likewise, it uses the BYM to model the spatial random effects.

s_covid$ID <- seq(1, nrow(s_covid))
formula <- total.cases ~ 1 + f(ID,
  model = "bym",
  graph = w
)

INLA_model <- INLA::inla(formula,
  data = s_covid,
  family = "poisson",
  E = s_covid$population,
  control.compute = list(
    dic = TRUE,
    waic = TRUE,
    cpo = TRUE
  ),
  verbose = FALSE
)

Finally, we fit a BHM to the same dataset using the INLA-SPDE approach.

initial_range <- diff(range(sp_data@data[, "LONG"])) / 5
max_edge <- initial_range / 8

mesh <- INLA::inla.mesh.2d(
  loc = sp_data@data[, c("LONG", "LAT")],
  max.edge = c(1, 2) * max_edge,
  offset = c(initial_range / 4, initial_range),
  cutoff = max_edge / 7
)

prior_range <- initial_range
spde <- inla.spde2.pcmatern(
  mesh = mesh,
  prior.range = c(prior_range, 0.5),
  prior.sigma = c(1, 0.01)
)

s_covid_cp <- s_covid
sp::coordinates(s_covid_cp) <- c("LONG", "LAT")
cmp <- total.cases ~ 0 + Intercept + f(main = coordinates, model = spde)

inlabru_model <- inlabru::bru(cmp,
  data = s_covid_cp,
  family = "poisson",
  E = s_covid_cp$population,
  control.family = list(link = "log"),
  options = list(
    verbose = FALSE
  )
)
## Warning in add_mapper(component$main, label = component$label, lhoods = lh, : All covariate evaluations for 'Intercept' are NULL; an intercept component was likely intended.
##   Implicit latent intercept component specification is deprecated since version 2.1.14.
##   Use explicit notation '+ Intercept(1)' instead (or '+1' for '+ Intercept(1)').
## Warning in handle_problems(e_input): The input evaluation 'Intercept' for
## 'Intercept' failed. Perhaps the data object doesn't contain the needed
## variables? Falling back to '1'.

Model comparison

In terms of model selection criteria, we show the different values for each model based on DIC and WAIC. The model fitted using the INLA-SPDE approach performs better than the other two models in terms of the lowest DIC and WAIC values.

modfit <- data.frame(
  DIC = c(MCMC_model$modelfit[1], INLA_model$dic$dic, inlabru_model$dic$dic),
  WAIC = c(MCMC_model$modelfit[3], INLA_model$waic$waic, inlabru_model$waic$waic)
)
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
rownames(modfit) <- c("MCMC", "INLA_BYM", "INLA_SPDE")
## Error: object 'modfit' not found
modfit
## Error in eval(expr, envir, enclos): object 'modfit' not found

We compare the posterior COVID-19 infection rate estimates between the models. In general, all models provide similar posterior COVID-19 infection rate estimates.

mcmc_fitted_prev <- exp(mean(MCMC_model$samples$beta) + apply(MCMC_model$samples$psi, 2, mean))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mean': object 'MCMC_model' not found
mcmc_lc <- exp(quantile(MCMC_model$samples$beta, 0.025) +
  apply(MCMC_model$samples$psi, 2, quantile, 0.025))
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
mcmc_uc <- exp(quantile(MCMC_model$samples$beta, 0.975) +
  apply(MCMC_model$samples$psi, 2, quantile, 0.975))
## Error in eval(expr, envir, enclos): object 'MCMC_model' not found
comb <- data.frame(
  "INLA_BYM" = INLA_model$summary.fitted.values$mean,
  "INLA_lc" = INLA_model$summary.fitted.values$`0.025quant`,
  "INLA_uc" = INLA_model$summary.fitted.values$`0.975quant`,
  "INLA_SPDE" = inlabru_model$summary.fitted.values$mean[1:nrow(s_covid)],
  "INLA_SPDE_lc" = inlabru_model$summary.fitted.values$`0.025quant`[1:nrow(s_covid)],
  "INLA_SPDE_uc" = inlabru_model$summary.fitted.values$`0.975quant`[1:nrow(s_covid)],
  "MCMC" = mcmc_fitted_prev,
  "MCMC_lc" = mcmc_lc,
  "MCMC_uc" = mcmc_uc
)
## Error in eval(expr, envir, enclos): object 'mcmc_fitted_prev' not found
pairs(comb[, c("MCMC", "INLA_SPDE", "INLA_BYM")],
  pch = 19, cex = 0.3, col = "orange", lower.panel = panel.smooth
)
## Error in eval(expr, envir, enclos): object 'comb' not found
boxplot(comb[, c("MCMC", "INLA_SPDE", "INLA_BYM")])
## Error in eval(expr, envir, enclos): object 'comb' not found

The spatial patterns of the infection rate estimates for each model are displayed below.

sp_data@data$est.rate.mcmc <- mcmc_fitted_prev
## Error in eval(expr, envir, enclos): object 'mcmc_fitted_prev' not found
domain <- sp_data@data$est.rate.mcmc
legend_values <- sp_data@data$est.rate.mcmc

fdmr::plot_map(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
)

Map of the predicted infection rates for using MCMC.

sp_data@data$est.rate.inlabym <- INLA_model$summary.fitted.values$mean

domain <- sp_data@data$est.rate.inlabym
legend_values <- sp_data@data$est.rate.inlabym

fdmr::plot_map(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
)

Map of the predicted infection rates for using INLA-BYM.

sp_data@data$est.rate.inlaspde <- inlabru_model$summary.fitted.values$mean[1:nrow(s_covid)]

domain <- sp_data@data$est.rate.inlaspde
legend_values <- sp_data@data$est.rate.inlaspde

fdmr::plot_map(
  polygon_data = sp_data,
  domain = domain,
  palette = "Reds",
  legend_title = "Rate",
  add_scale_bar = TRUE,
  polygon_fill_opacity = 0.8,
)

Map of the predicted infection rates for using INLA-SPDE.

Now we consider the scenario of fitting the models with covariates of interest. We select several risk factors used in the COVID-19 tutorial. The models described above are fitted and compared again after incorporating covariate information. First we fit the BHM using MCMC method.

form <- total.cases ~ 1 + offset(log(population)) +
  IMD + perc.wb + perc.ba + age1 + pm25

MCMC_model2 <- CARBayes::S.CARbym(
  formula = form,
  data = s_covid,
  family = "poisson",
  W = w,
  burnin = 10000,
  n.sample = 30000,
  thin = 10,
  verbose = F
)
## Error in loadNamespace(x): there is no package called 'CARBayes'

Next we fit the BHM using the INLA-BYM approach.

formula <- total.cases ~ 1 + f(ID,
  model = "bym",
  graph = w
) + IMD + perc.wb + perc.ba + age1 + pm25

INLA_model2 <- INLA::inla(formula,
  data = s_covid,
  family = "poisson",
  E = s_covid$population,
  control.compute = list(
    dic = TRUE,
    waic = TRUE,
    cpo = TRUE
  ),
  verbose = FALSE
)

Next we fit the BHM using the INLA-SPDE approach.

cmp <- total.cases ~ 0 + Intercept + f(main = coordinates, model = spde) +
  IMD + perc.wb + perc.ba + age1 + pm25

inlabru_model2 <- inlabru::bru(cmp,
  data = s_covid_cp,
  family = "poisson",
  E = s_covid_cp$population,
  control.family = list(link = "log"),
  options = list(
    verbose = FALSE
  )
)
## Warning in add_mapper(component$main, label = component$label, lhoods = lh, : All covariate evaluations for 'Intercept' are NULL; an intercept component was likely intended.
##   Implicit latent intercept component specification is deprecated since version 2.1.14.
##   Use explicit notation '+ Intercept(1)' instead (or '+1' for '+ Intercept(1)').
## Warning in handle_problems(e_input): The input evaluation 'Intercept' for
## 'Intercept' failed. Perhaps the data object doesn't contain the needed
## variables? Falling back to '1'.

The regression coefficients of the selected covariates for all models are compared. In general, the models have similar regression coefficients estimates.

regr_est <- cbind.data.frame(
  "MCMC" = MCMC_model2$summary.results[1:(nrow(MCMC_model2$summary.results) - 2), 1],
  "INLA_BYM" = INLA_model2$summary.fixed[, 1],
  "INLA_SPDE" = inlabru_model2$summary.fixed[, 1]
)
## Error in eval(expr, envir, enclos): object 'MCMC_model2' not found
regr_est
## Error in eval(expr, envir, enclos): object 'regr_est' not found
Besag, Julian, Jeremy York, and Annie Mollié. 1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43: 1–20.
Lee, Duncan. 2013. “CARBayes: An r Package for Bayesian Spatial Modeling with Conditional Autoregressive Priors.” Journal of Statistical Software 55 (13): 1–24.